Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

fix: the problem that the pending tasks cannot be scheduled during the backfill action #4029

Merged
merged 1 commit into from
Feb 28, 2025

Conversation

hansongChina
Copy link
Contributor

What type of PR is this?

What this PR does / why we need it:

During the backfill action, in the loop of pending tasks, if a previous task fails to match a node or an exception occurs when calling the PrePredicateFn method, all subsequent tasks will stop being scheduled.

Which issue(s) this PR fixes:

Fixes #4028

Special notes for your reviewer:

Does this PR introduce a user-facing change?


@volcano-sh-bot volcano-sh-bot added the kind/bug Categorizes issue or PR as related to a bug. label Feb 21, 2025
@volcano-sh-bot volcano-sh-bot added area/scheduling size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Feb 21, 2025
Copy link
Member

@hwdef hwdef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Feb 24, 2025
@hwdef
Copy link
Member

hwdef commented Feb 24, 2025

/ok-to-test

@volcano-sh-bot volcano-sh-bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed lgtm Indicates that a PR is ready to be merged. labels Feb 24, 2025
@JesseStutler
Copy link
Member

JesseStutler commented Feb 25, 2025

Great. But I'm wondering isn't there also a problem in the allocate action? After the break here:

, the job is not pushed back, and the remaining pods in this job are not scheduled

cc @Monokaix @lowang-bh @hwdef

@lowang-bh
Copy link
Member

lowang-bh commented Feb 25, 2025

Great. But I'm wondering isn't there also a problem in the allocate action? After the break here:

, the job is not pushed back, and the remaining pods in this job are not scheduled
cc @Monokaix @lowang-bh @hwdef

It is ok when minMember equals to total replicas.
Otherwise, maybe it need to be considered as this:

if job.NeedContinueAllocating() {

@Monokaix
Copy link
Member

Great. But I'm wondering isn't there also a problem in the allocate action? After the break here:

, the job is not pushed back, and the remaining pods in this job are not scheduled
cc @Monokaix @lowang-bh @hwdef

It is ok when minMember equals to total replicas. Otherwise, maybe it need to be considered as this:

if job.NeedContinueAllocating() {

We can seperate it to another pr.

@Monokaix
Copy link
Member

Please squash to one commit.

@JesseStutler
Copy link
Member

Great. But I'm wondering isn't there also a problem in the allocate action? After the break here:

, the job is not pushed back, and the remaining pods in this job are not scheduled
cc @Monokaix @lowang-bh @hwdef

It is ok when minMember equals to total replicas. Otherwise, maybe it need to be considered as this:

if job.NeedContinueAllocating() {

We can seperate it to another pr.

I think using NeedContinueAllocating is a good choice, Could you also help to fix allocate action? @hansongChina

@Monokaix
Copy link
Member

NeedContinueAllocating()

allocate is a more sensitive action, add this logic may cause some performance issue, maybe we should reconsider that case: )

@hansongChina
Copy link
Contributor Author

Please squash to one commit.

OK

@hansongChina
Copy link
Contributor Author

Great. But I'm wondering isn't there also a problem in the allocate action? After the break here:

, the job is not pushed back, and the remaining pods in this job are not scheduled
cc @Monokaix @lowang-bh @hwdef

It is ok when minMember equals to total replicas. Otherwise, maybe it need to be considered as this:

if job.NeedContinueAllocating() {

We can seperate it to another pr.

I think using NeedContinueAllocating is a good choice, Could you also help to fix allocate action? @hansongChina
Sure. Should we use the method "NeedContinueAllocating" to make a judgment once, or directly use "continue" to let all the tasks of this job attempt to be scheduled once?

…e backfill action

Signed-off-by: hansong <252671213@qq.com>
@Monokaix
Copy link
Member

/approve

@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Monokaix

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 26, 2025
@JesseStutler
Copy link
Member

/lgtm

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Feb 28, 2025
@volcano-sh-bot volcano-sh-bot merged commit cdfb83c into volcano-sh:master Feb 28, 2025
17 checks passed
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/scheduling kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

the problem that the pending tasks cannot be scheduled during the backfill action
6 participants